Semi-deconvolution of bulk and single-cell RNA-seq data with application to metastatic progression in breast cancer

Abstract Motivation Identifying cell types and their abundances and how these evolve during tumor progression is critical to understanding the mechanisms of metastasis and identifying predictors of metastatic potential that can guide the development of new diagnostics or therapeutics. Single-cell RNA sequencing (scRNA-seq) has been especially promising in resolving heterogeneity of expression programs at the single-cell level, but is not always feasible, e.g. for large cohort studies or longitudinal analysis of archived samples. In such cases, clonal subpopulations may still be inferred via genomic deconvolution, but deconvolution methods have limited ability to resolve fine clonal structure and may require reference cell type profiles that are missing or imprecise. Prior methods can eliminate the need for reference profiles but show unstable performance when few bulk samples are available. Results In this work, we develop a new method using reference scRNA-seq to interpret sample collections for which only bulk RNA-seq is available for some samples, e.g. clonally resolving archived primary tissues using scRNA-seq from metastases. By integrating such information in a Quadratic Programming framework, our method can recover more accurate cell types and corresponding cell type abundances in bulk samples. Application to a breast tumor bone metastases dataset confirms the power of scRNA-seq data to improve cell type inference and quantification in same-patient bulk samples. Availability and implementation Source code is available on Github at https://github.com/CMUSchwartzLab/RADs.


Introduction
Computational methods for resolving single-cell clonal evolutionary dynamics (Greaves and Maley, 2012) have become a central part of modern cancer genomics research as ever more powerful genomic tools have become available and as the role of tumor heterogeneity and clonal evolution in cancer progression have become more apparent (Beerenwinkel et al., 2016). The fundamental goal of such methods is to characterize the genetics and genomics of tumor cells and various other cell types infiltrating them and understand how these populations of cells predict future tumor progression and evolve over its course. Numerous variations on this basic framework have been developed, for different kinds of genomic data (e.g. DNAseq versus RNA-seq, bulk versus single-cell) or different research questions (e.g. understanding genetic versus phenotypic evolution) (cf., Schwartz and Schä ffer, 2017). In the present work, we focus on one specific scenario: understanding RNA evolution in settings in which bulk and single-cell data are available for different time points, samples or stages of progression.
Prior to the emergence of practical single-cell methods, most techniques for reconstructing clonal evolution depended on genomic deconvolution (Lu et al., 2003), in which one seeks to resolve clonal evolution by computationally inferring activities of homogeneous cell populations from mixtures of genomic data contained in bulk samples (Schwartz and Shackney, 2010). Methods for this problem can roughly be classified into two classes: partial deconvolutional algorithms, which interpret data in terms of a reference matrix of known cell types, and complete deconvolutional algorithms, which infer cell types de novo by comparison of multiple samples. Examples of partial deconvolution methods include CIBERSORT, which makes use of a pre-defined LM22 profile matrix and CIBERSORTx (Newman et al., 2019), which derives a profile matrix for interpreting bulk data using reference singlecell data. See Avila Cobos et al. (2020) for a comparison of different partial deconvolution algorithms and related data transformation methods. Examples of complete deconvolution methods include Geometric Unmixing (Schwartz and Shackney, 2010), which proposed an archetype analysis method based on geometries of genomic point clouds; DSA (Zhong et al., 2013), which treats deconvolution as a matrix factorization problem; LinSeed (Zaitsev et al., 2019), which identifies a set of anchor genes through linear correlation and uses DSA to solve for the non-anchor genes; NND (Tao et al., 2020a), which poses partial deconvolution problem as a matrix factorization to be solved with i386 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics, 38, 2022, i386-i394 https://doi.org/10.1093/bioinformatics/btac262 ISCB/ISMB 2022 gradient descent implemented through a neural network; and RAD (Tao et al., 2020b), which solve s the formulation of NND using a hybrid optimizer with improved accuracy and speed.
Single-cell genomics has rapidly displaced deconvolutional methods for tumor genomic analysis, particularly for RNA-seq variants, as single-cell sequencing has become reliable and cost-effective (cf. Kuipers et al., 2017;Lim et al., 2020). Although single-cell sequencing introduces its own computational complications and data quality issues, having large numbers of direct single-cell measurements leads to substantially greater resolution for single-cell variation than is possible for deconvolutional methods even with high quality bulk data. Nonetheless, deconvolutional methods remain necessary in practice for a variety of real-world use cases. Single-cell data is not typically possible for older archived samples and the field still lacks large cohorts of single-cell tumor genomic data comparable to bulk resources such as the influential Cancer Genome Atlas (Chang et al., 2013) or International Cancer Genome Consortium (Zhang et al., 2011) datasets. The problem is particularly acute for current studies of patients being tracked longitudinally, e.g. in using clonal phylogenetics to understand metastatic progression (Naxerova and Jain, 2015). Patients being seen today for metastatic disease may have been first diagnosed years earlier and characterizing their tumors' evolution can require comparing recent samples for which single-cell data is practical with archived samples for which only bulk data is possible.
This work was developed specifically to address scenarios such as this, in which one seeks to understand longitudinal progression of a cancer by comparing samples some of which may be amenable to single-cell methods (e.g. recent metastases) and others that can only be examined by bulk methods (e.g. an archived primary tumor biopsy). It accomplishes this by developing a hybrid algorithm to infer genomics and clonal frequencies in both bulk and single-cell samples, using single-cell data in some samples as partial references for bulk data in others. In this regard, it follows a strategy of mixed bulk and single-cell data previously applied in tumor evolutionary studies primarily with single-cell and bulk DNA-seq (Lei et al., 2020;Malikic et al., 2019a,b;Salehi et al., 2017) or using bulk DNA-seq to guide interpretation of single-cell RNA-seq data (McCarthy et al., 2020;Shafighi et al., 2021). Bulk and single-cell RNA-seq has been previously combined in bMIND (Wang et al., 2021), although with a different goal of using paired data from single samples to better reconstruct cell type profiles. Reference single-cell RNA-seq has also been used to train neural networks to deconvolve independent bulk RNA-seq data (Menden et al., 2020). Our method poses the problem using a matrix factorization formulation comparable to earlier bulk deconvolution methods, drawing on ideas from hybrid bulk/single-cell DNA methods to integrate the heterogeneous data sources. We demonstrate through simulated data and real paired primary and metastatic patient data that the method is effective at resolving cell population dynamics across time points in comparison to more traditional deconvolution methods, providing novel insight into how cell population dynamics can underlie tumor progression.

Materials and methods
Our method can be divided into two steps: First, we infer a profile matrix S to represent the single-cell data from metastatic samples. Then, we apply a semi-deconvolution algorithm that uses elements of both partial and complete deconvolution algorithms to avoid weaknesses of each. The objective/loss function consists of two parts: a complete deconvolution part, which by itself would introduce error due to the difficulty of finding a low-dimensional approximation for a noisy high dimensional mixture, and a partial deconvolution part, which by itself would introduce error due to imprecise approximation of the bulk data by single cells gathered from different samples and technologies. By combining these terms, we seek to mitigate the complementary weaknesses of each approach. The approach is particularly effective for characterizing how population frequencies of distinct cell types evolve over progression stages, e.g. via differential immune infiltration (cf. Sturm et al., 2019). Unlike other methods also using reference profiles, our method allows the inference of cell types not found in the single-cell samples. We call our method RAD with single cells (RADs). The operation and a high-level description of the method are summarized in Figure 1.

Mathematical formulation of deconvolution
We aim to achieve better deconvolution by integrating information from the noisy but informative single-cell RNA-seq via the following constrained optimization objective: min C;F;l jjB À CFjj 2 Fr þ kjjC 1 À lSjj 2 Fr ; (1) s:t: C il ! 0; i ¼ 1; . . . ; m; l ¼ 1; . . . ; K; (2) F lj ! 0; l ¼ 1; . . . ; K; j ¼ 1; . . . ; n; (3) B 2 R mÂn !0 is the bulk RNA-seq data and S 2 R mÂk !0 is a preprocessed matrix of single-cell reference data, where each row is a gene, each column of B is a bulk sample and each column of S is a cell type/ population. k is the penalty weight that suppresses the discrepancy between single-cell data and the inferred expression profile matrix and l adjusts the scale of inferred single-cell references. Additional parameters of Equations (1-4) and other auxiliary variables are defined in Table 1.

Solving the deconvolution problems
The original problem formulation Equations (1-4) is non-negative. Inspired by previous work, we use a coordinate descent method to obtain a possibly suboptimal solution to the problem. The coordinate descent algorithm iteratively repeats three phases described below until convergence, where in each phase, the computational problem can be re-formulated to a tractable problem class, e.g. quadratic programming or linear regression.
Phase 1: Optimizing fraction matrix F At this phase, we fix C and l to optimize F. Then, Equations (1-4) is equivalent to: s:t: F lj ! 0; l ¼ 1; . . . ; k; j ¼ 1; . . . ; n; Let f ¼ F Áj be the column in Equation (5) representing the frequency of each cell type/population in one bulk sample B Áj (defined as b). Then we can re-formulate it to minimize the error for each column of B À CF:
Phase 2: Optimizing cell type matrix C. At this phase, we fix F and l to optimize C, then the Equation (1) equals to: Given that not all primary cell types migrate to metastases, there may be cell types found in the primary tumor but not in single-cell reference samples. We therefore allow the primary tumor to have additional cell types not found in the metastasis, by decomposing C into C 1 , representing cell types found in single-cell samples, and an auxiliary matrix C 2 , representing additional cell types present only in bulk samples (Table 1). C ¼ ½C 1 ; C 2 is the horizontal concatenation of C 1 and C 2 : . . . : : . . . : : . . . Then Equation (14) can be rewritten as: We follow coordinate descent to optimize C 1 and C 2 iteratively: Phase 2.1: Optimizing C 1 . Let c 1 ¼ C 1 > iÁ , which represents the expression profile of one gene in each cell type/population.
to represent the transpose of one row of B and S, respectively. Let F ¼ ½F > 1 ; F > 2 > be the vertical concatenation of F 1 and F 2 . Then we can re-formulate Equation (16) to a standard form for C 1 :  Total cell types (m gene Â K cell types) in primary tumor; note K ¼ k þ y C S mÂk Expression profile (m gene Â k cell types) in metastatic tumor prior to zero-inflation corrections S mÂk Representative reference (m gene Â k cell types) from single-cell data in metastases l Scaling factor for S k Penalty term to balance information from S () min ðC1Þ iÁ jjB iÁ À ðC 1 Þ iÁ F 1 jj 2 2 þ kjjðC 1 Þ iÁ À lS iÁ jj 2 2 ; ð8i ¼ 1; . . . ; m; we then omit subscript i for clarityÞ with the same constraint as shown in Equation (17), where I is an identity matrix, which implies iÁ , which represents the expression profile of one gene in each cell type/population. Similarly, we define b ¼ B > iÁ to represent the transpose of one row of B. Then we can re-formulate Equation (16) to a standard form for C 2 : We found that Equation (32) has similar form to Equation (13). It is also not hard to see that ðF 2 F > 2 Þ > ¼ F 2 F > 2 , so the term F 2 F > 2 is symmetric, which also indicates that Equation (32) is a Quadratic Programming problem with the same constraints as Equation (18).
Phase 3: Optimizing scaling factor l. At this phase, we fix F and C 1 to optimize l. Then Equation (1) equals to: Since l is a scalar and C 1 and S are two known matrices with the same dimension of m Â l, then Equation (33) can be viewed as a Linear Regression problem of which the goal is to best fit the model regarding the independent variable S to the data C 1 by using the least-squares method to find the optimal value of l that minimizes the residual sum of squares: In summary, we have shown that the deconvolution problem can be divided into three main phases. In each phase, we have formulated the sub-problem to be either a QP problem or a Linear Regression problem. With third-party software (e.g. CVXOPT), we iteratively solve each phase by using the coordinate descent algorithm until the convergence to get optimal values for F; C 1 ; C 2 and l.

Simulated datasets
It is not possible to establish with certainty the ground truth for any real deconvolution dataset and so we rely partially on simulated data for validation. Our main strategy is to rely on true single-cell RNA-seq data from paired primary and metastatic breast cancer samples to generate artificial bulk data composed of mixtures of single cells, for which we would then have a known ground truth. We simulated bulk RNA-seq of primary tumor tissue samples based on the following assumptions: (i) the underlying gene expression profile of each cell type in bulk RNA-seq would be similar to the average expression values of cells belonging to the same type in single-cell RNA-seq had dropout events in single-cell RNA-seq not occurred; and (ii) the fractions of different cell types in metastatic tissue samples will be different from those in primary tumor tissue samples.
First, we inferred the cell type expression matrix of bulk RNAseq from the real single-cell RNA-seq data of metastatic tumor tissue samples. The single-cell RNA-seq results from two distinct metastases of the tumor, BoM1 and BoM2, were normalized to CPM (counts per million) before being aggregated. For each cell type, the aggregated cell expression values were averaged to obtain its gene expression profile. The gene expression profile was normalized to CPM again before being corrected for the effect of single-cell RNAseq drop-out events (see Section 2.3.3).
The corrected gene expression profile then served as the ground truth cell type expression matrix for simulating the bulk RNA-seq samples. In total, 5000 randomly selected genes and all k cell types from the gene expression profile were log 2 transformed and replicated n times to generate each of the n primary tumor tissue sample's component matrix. Inter-sample noise was added to the component matrices by replacing each of jth gene's expression value x with one instance drawn from the Gaussian distribution Nðx; r j =5Þ, where r j is the standard deviation of jth gene's expression values across cell types in the log 2 space. The noisy component matrices were then projected back to the linear space. The fraction vector of the k cell types in these primary tumor samples was generated by perturbing the frequency of each cell type in the real singlecell RNA-seq of metastatic tissue samples in log 2 scale: each of the k frequencies f i was replaced by one instance drawn from the Gaussian distribution Nðf i ; r f =2Þ, with r f being the standard deviation among the k cell types' frequencies. This fraction vector was then replicated n times before the addition of inter-sample noise in the same way as the cell type-gene expression component matrix. The fractions for all k cell types were then summed to normalize the matrix, resulting in a n Â k fraction matrix where each of the n fraction vector sums to 1.
Finally, the n simulated bulk RNA-seq of primary tumor tissue samples were generated by convolving each component matrix with its corresponding fraction vector.

Bulk and single-cell RNA-seq datasets
We tested the method on a real dataset consisting of bulk RNA-seq derived from a formalin-fixed parafin embedded primary breast tumor sample and single-cell RNA-seq data derived from two bone metastases. Statistics on the dataset used in the present work can be found in Table 2.

Preprocessing to construct S
The single-cell RNA-seq experiment produces an n s Â m expression matrix for m genes and n s cells sequenced in total. Using known biomarkers for different cell types, each of the n s cells could be annotated with one of the k cell types identifiable. After normalizing the expression values into CPM, the k Â m matrix C s can be calculated by averaging expression values for each of the m genes across cells with the same cell type label.
However, single-cell RNA-seq data includes noise and bias from dropout events, where a gene moderately expressed in some cells failed to be detected in other cells. These drop-outs were hypothesized to be the result of failed reverse transcription during the sequencing experiment, as suggested in previous attempts at quantifying mRNA at single-cell level using RT-qPCR (Bengtsson et al., 2008). These dropout events require additional correction to guide the deconvolution of dropout-free bulk RNA-seq meaningfully. In general, the smaller an expression value is, the higher its probability of being undetected in the experiment will be.
To correct for the effect of dropout events, we assume that the relationship between the jth gene's average expression value l j and its dropout frequency follows a Michaelis-Menten function (Andrews and Hemberg, 2019): where K j M is the Michaelis constant representing the mean expression value of jth gene required for half of the cells to be detected. Therefore, we could fit the Michaelis-Menten function with mean expression values and percentages of values dropped from the real single-cell RNA-seq data to estimate the K M for each gene. Then given any average expression value for the cell type, it would be possible to estimate P dropout using the Michaelis-Menten function. The profile expression vector for the jth gene measured in a bulk RNAseq experiment S j 2 R 1Âk !0 can then be inferred as:

Evaluation
We evaluated the performance of our method by comparing the inferred expression profilesĈ and corresponding fractionsF with the ground truth C and F by utilizing four different metrics: R 2 C (Pearson coefficient ofĈ and C), L 1 loss (jjĈ À Cjj 1 =jjCjj 1 ), R 2 F (Pearson coefficient ofF and F) and MSE (mean square error ofF and F). Except where otherwise noted, we used k ¼ 0:1, chosen empirically to give good results across a variety of datasets.

Our deconvolution is unbiased and robust on simulated data
First, we tested our method on the simulated data generated as discussed in Section 2.3.1. We generated a single-cell matrix C with size of m ¼ 5000 random genes and k ¼ 5 known cell types as well as a profile matrix S from the current Single-cell RNA (scRNA) dataset. We also allow y ¼ 1 unknown cell type that only exists in the bulk sample. Noise has been introduced to bulk and single-cell samples to mimic the relatively low resolution in bulk sequencing and individual-level differences in single-cell RNA-seq data, respectively. When the number of samples increases, the average performance of pure RAD becomes slightly better while the variance is still large. We also find that in most cases, adding a profile matrix S further improves the inference by reducing the L 1 loss (Fig. 2a). In the current experiment, we used S as the initialization for both RAD and RADs. This is equivalent to feeding prior information to RAD, which is usually not the case for RAD since it assumes that no single-cell information available. Adding such priors could yield faster and better convergence in some cases.
The inference of F shows a similar pattern to that for C, although they are not sensitive to small number of samples (Fig. 2b). This is not surprising since F has more constraints and the search space of both is much smaller than C. Noise was also introduced (b noise and s noise are not 0), and we then find that cell component deconvolution with S outperforms pure RAD and that of cell component deconvolution without S in all cases as well, and it is robust to the noise (Fig. 2). Based on the results, we can conclude that adding information from C (e.g. using profile matrix S) can help in bulk sequencing data deconvolution particularly with small numbers of bulk samples.

Comparison with other methods
In this section, we compared our method to two other popular deconvolution methods: DSA, a complete deconvolution method requiring a list of highly expressed genes corresponding to each cell type; and CIBERSORTx, a semi-deconvolution method with reference gene expression profiles from other tissues. There are some restrictions for these methods, e.g. DSA cannot work when there is only one bulk sample to deconvolve and CIBERSORTx requires the number of bulk samples to exceed the number of cell types. In  2. Performance of deconvolution when bulk samples are limited. We show representative results of RAD, RADs without reference, RADs with reference for number of bulk samples n ¼ 1 while number of cell type K ¼ 6 (five known cell types and one unknown cell type). We compared the performance both on estimated C and F using the evaluation metrics L 1 loss, R 2 C for C and MSE, R 2 F for F, respectively. (a) RAD using S as initialization, (b) RADs without S (k ¼ 0) and (c) RADs with S (k ¼ 0:1) addition, neither of these two methods is able to infer information regarding unknown cell types absent in the reference gene list or expression profiles.
In order to make the comparison reasonable, we ran two sets of experiments. In the first experiment, we ran DSA on a small number of bulk samples (e.g. n ¼ 2, 4). Note that DSA can only infer the cell types that are present in S so the L 1 loss and MSE were only calculated on the known cell types, although our method includes both known and unknown cell types. We find DSA to have similar performance to RAD and RADs without reference, which is worse than that of RADs with reference in cases with or without noise (Fig. 3,  red boxes).
In the second experiment, we increased the number of simulated bulk samples to be 7, but varied the number of known and unknown cell types to be 5:1, 4:2 and 3:3, respectively. This setting is intended to align with requirement of CIBERSORTx and also allow us to investigate the effects of known information on the deconvolution performance. We find that DSA and CIBERSORTx perform worse than our method (Fig. 4d, different boxes). This might be due to the fact that DSA and CIBERSORTx can only infer the cell types in S and any unknown cell types then are included in the bulk sample profiles as noise rather than independent components in the deconvolution result, while our method considered both known and unknown cell types in the bulk samples and separated them in the deconvolution result. We also find that when there are more unknown cell types, our method still works but the performance became worse (Fig. 4d), different X labels). This is not surprising since more unknown cell types means less useful information in S, which makes the penalty term in Equation (1) less effective. We further note that the problem is ill-posed if the number of unknown cell types exceeds the number of bulk samples. This also indicates the importance of correct reference information in the bulk deconvolution.

Systematic changes in the breast cancer metastatic microenvironment
In this section, we applied our deconvolution method to the real data described in Section 2.3.2. We first retrieved genes that were Average results of RAD, RADs without reference, RADs with reference for number of bulk samples n ¼ 1, 2, 4 and number of cell types K ¼ 6 (five known cell types and one unknown cell type) with different levels of noise (b_noise ¼ 0.0, 0.1, s_noise ¼ 0.0, 0.1). Performance was calculated across 10 replicates, using L 1 loss and MSE for C and F, respectively. (a) Performance on C inference, (b) Performance on F inference. Different boxes show RAD with S as initialization, RADs without S, RADs with S and DSA, respectively (see Section3.2 for details; note that DSA cannot work when bulk sample n ¼ 1) Fig. 4. Performance compared with DSA and CIBERSORTx (replicates ¼ 10). Deconvolution on seven simulated bulk samples without noise using six cell types. Unknown cell type(s) were set to be 1, 2 and 3, respectively. (a-c) representative result for DSA, CIBERSORTx and RADs when there are three unknown cell types. (d) Average performance on F (left) and S (right) inference from the three methods. Metrics of DSA and CIBERSORTx were calculated only on the known cell types in the inferred results and ground truth while sRAD considered both known and unknown cell types differentially expressed across different cell types in the single-cell RNA-seq data. For each gene, its log 2 -scaled expression values in all cells annotated to be one cell type was compared with those in all other cells. A Wilcoxon rank-sum test was performed during the comparison to determine if the gene was significantly differentially expressed in the foreground cell type compared with the background. A Benjamini-Hochberg corrected P-value cutoff of 0.001 yielded a total of 1226 genes differentially expressed across all six cell types annotated in the single-cell RNA-seq. Then, we checked with the bulk sample to make sure that the same set of genes is available for both bulk-and single-cell samples, which finally yielded 1196 genes. The expression profiles of the 1196 genes from the bulk sample were then retrieved to compose B, and those from single-cell samples were retrieved to compose C S . The profile matrix S was inferred from C S as described in Section 2.3.3. B and S are the input for Equation (1). We also allow for one cell type (C 2 ) to represent the cell type only found in primary tumor as well to balance the difference between bulk and single-cell sequencing data.
The inferred gene expression profiles from bulk samples were found to match very well with the profile matrix S from single-cell samples. This indicates that the penalty term in Equation (1) works with a non-zero regularization term (Fig. 5a, k ¼ 0:1) in real data. The heat-map showing the gene expression also yields distinct expression patterns for different cell types (Fig. 5b). However, fractions of cell types exhibit different patterns between primary site and metastatic sites. For example, the fibroblasts cell type takes a low proportion in the primary site, but makes up for a large one in the metastatic sites. Although the only bulk sample available might have an unrepresentative low fibroblast content at the specific location where the tissue was sampled (e.g. high fibroblasts may be needed for the tumor to attach to the bone), the large proportion of fibroblasts in the bone metastases is consistent with previous studies claiming that cancer-associated fibroblasts contribute to tumor growth, invasion and metastasis (Joshi et al., 2021;Kalluri and Zeisberg, 2006), which leads to cancer malignancy in later stages. The fraction of lymphocytes also exhibits an interesting pattern. The primary sample shows a higher fraction than the average fraction in bone metastatic samples (Fig. 5, rightmost versus leftmost bar) although BoM1 includes more lymphocytes than the primary sample, whereas BoM2 includes fewer. However, the average of BoM1 and BoM2 shows a lower fraction than the primary sample (Fig. 5, leftmost bar). We also find that there are almost no myeloid cells in the primary tumor (fraction ¼ 1:3 Â 10 À4 ) but some in bone metastases. A closer examination found that among the single-cell samples labeled as myeloid cell, macrophages occupy a large proportion (77.4%, from a finer cell-type annotation on the same dataset, data not shown). All these findings are consistent with prior work, which concluded that metastatic breast cancers show reduced immune cells but increased macrophages compared with primary tumors (Zhu et al., 2019). We also found epithelial cells relatively preserved in both primary and metastatic tumors. This is consistent with breast cancer origin from non-diseased epithelial tissue and transition to metastasis (Nguyen et al., 2018;Wang and Zhou, 2011). The fractions of osteoclasts can be used as negative controls since they are bone-tissue related cell types that should be rare in primary sites, and indeed they exhibit fractions close to zero in the primary tumor. The unknown cell types, meaning inferred cell types not found in the single-cell data, account for over 30% in the primary tumor. Although this unknown type is close to endothelial (Fig. 5c) based on the expression distance, we would not assign any biological meaning to it until we have better evidence (e.g. comparing to single-cell data from the same primary tumor or other reliable public data). We interpret it as a free component that accounts for the difference between primary and metastatic tumor composition.
We then explored the inferences from the perspective of variations in inferred single-cell gene expression and their interpretation with respect to Gene Ontology (GO) enrichment pathways. We first show that S extracted from real data is a good representative of the true single-cell data. It has low average distance to the expression profiles of each of the single-cell samples and shows high correlation with such gene expression profiles ( Fig. 6a and b). Although S represents the available single-cell samples well while still allowing us to infer a reasonable C at the cell-type level, some variations in gene expression were observed between inferred primary tumor C and the single-cell RNA-seq data measured from each of the two bone metastases. This may suggest changes in gene expression at the single-cell level along the metastatic trajectory of the tumor. The top 1% most down-regulated and up-regulated genes in each metastatic sample, measured by their distances from the inferred expression values for the primary tumor sample, were selected for downstream GO enrichment analysis (Fig. 6c). The top up-regulated genes in both metastatic samples showed significant enrichment for the receptor binding of several chemokines that have been found to have a direct impact on metastasis, including CCR5 and CCR1 (Mollica Poeta et al., 2019). This is consistent with previous reports that increased in CCR5 activity leads to increased homing behavior to metastatic sites in breast cancer (Jiao et al., 2021). Similarly, the knock-down of CCR1 was experimentally shown to inhibit metastasis of breast cancer (Shin et al., 2017). The up-regulation in both metastatic samples could also have the effect of promoting chemotaxis of natural killer cells that may be recruited in response to increased activity of chemokines CCR1 and CCR5 (Aldinucci et al., 2020). Additionally, increased expression of genes enriched for binding of IgG, such as is observed with BoM2, has been hypothesized to be a driver of breast tumor metastasis (Cui et al., 2021).

Discussion
In this work, we develop a novel tumor deconvolution method, RADs, designed for scenarios in which we have mixtures of bulk and single-cell data. We show that the method improves on standard bulk deconvolution or reference-based variants and works well even when there are limited numbers of bulk samples. Although our formulation of deconvolution using single-cell data from the same patient alleviates some bias induced by independent single-cell reference profiles, the formulation still implicitly requires all potential populations be available in the single-cell data. The extension of C to unknown populations not available in single-cell data can eliminate another source of inaccuracy in reference-based deconvolution. This heuristic idea not only has a biological meaning, such that some cell types in the primary tumor might not migrate to other organs or tissues, but also lends itself well to the coordinate descent algorithm. Unlike other methods, our method directly takes advantage of single-cell data from metastatic sites of the same patient, providing a more accurate reference than those from unrelated normal tissue or a panel of reference samples. In addition, our method can work on a very limited number of bulk samples and still achieve reasonable results when the number of inferred cell types exceeds the number of bulk samples, normally a hard problem for complete deconvolution. The results on the real data align well with existing work on the breast cancer bone metastases as described in Section 3.3.
There are some limitations of our method, however. The method is still limited in its ability to infer new cell types with few bulk samples, making it difficult to characterize progression via novel evolution or complete loss of clones. It would benefit from further evaluation on this point and on its ability to discriminate similar cell types with finer resolution. It additionally depends on some model parameters for which we so far lack principled methods for automated selection. Also, our method requires matched bulk and single-cell samples from the same patient, even if they can come from different sites and progression stages. Although this is motivated by an important use case, the combination of data is still not common. Future work will explore how same-patient and thirdparty reference data may be synergistic in bypassing this limitation. Nevertheless, we believe our method builds a bridge between limited data and good deconvolution performance, and provides strategies for better leveraging heterogeneous data modalities that may have broader applications in cancer research and other single-cell biology.

Funding
This work was partially supported by National Institutes of Health awards [R21CA216452 and R01HG010589]; Pennsylvania Department of Health award [No. 4100070287]; Susan G. Komen for the Cure; the Mario Lemieux Foundation; and the Breast Cancer Alliance. It was also partially supported by the AWS Machine Learning Research Awards granted to J.M. and R.S. and by the Center for Machine Learning and Health Fellowship granted to Y.T. The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions.